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Abstract: The powerful (and so far under-utilized) Goulden-Jackson Cluster method for find- 
ing the generating function for the number of words avoiding, as factors, the members of a 
prescribed set of 'dirty words', is tutorialized and extended in various directions. The authors' 
00 ■ Maple implementations, contained in several Maple packages available from this paper's website 

a^ 
oo 



o 
u 



X 



tittp : //www. math, temple . edu/~ zeilberg/gj .html, are described and explained. 



Preface 

In New York City there is a hotel called ESSEX. Once in a while the bulbs of the first two letters of 
its neon sign go out, resulting in the wrong message. This motivates the following problem. Given 
a finite alphabet, and a finite set (lexicon) of 'bad words', find the number of n-lettered words in 
the alphabet that avoid as factors (i.e. strings of consecutive letters) any of the dirty words. More 
generally, count the number of such words with a prescribed number of occurrences of obscenities 
(the previous case being bad words), and even more generally, count how many words are there 
with a prescribed number of occurrences of each letter of the alphabet, and a prescribed number 
of occurrences of each of the bad words. 

^ . Many problems in combinatorics, probability, statistics, computer science, engineering, and the 

. natural and social sciences, are special cases of, or can be formulated in terms of, the above 

, scenario. It is a rather well-kept secret that there exists a powerful method, the Goulden-Jackson 

^ ; Cluster method [GoJl] [Go J2], to tackle it. 

00 

Q>^ . In this paper we start with a motivated and accessible account of the method, and then we generalize 

it in various directions. Most importantly, we describe our Maple implementations of both the 
, original method and of our various extensions. These packages are obtainable, free of charge, from 



this paper's very own website http: //www. math. temple. edu/~ zeilberg/gj .html 



The Goulden-Jackson Cluster method is very similar, and in some sense, a generalization of, the 
method of Guibas and Odlyzko[GuiO], whose main motivation was Penney-ante games. How- 
5^ , ever, philosophically, psychologically, and conceptually, the Goulden-Jackson and Guibas-Odlyzko 

methods are quite distinct, and we find that the former is more suitable for our purposes. 

The Naive Approach 

Before describing the Cluster method, let's review the naive approach. First, some notation. Given 
a word w = Wi, . . . , Wn, a factor (burrowing the term from the theory of formal languages) is any 
of the {^^2^) words WiWi-^-i . . . wj-iWj, for 1 < i < j < n. For example the factors of JOHN are J, 
O, H, N, JO, OH, HN, JOH,OHN, JOHN, while the factors of DORON are D, O, R, O, N, DO, 
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OR, RO, ON, DOR, ORO, RON, DORO, ORON, DORON. Note that a given word may occur 

several times as a factor, for example the one-letter word O in DORON, or the two-letter words 
CA and T/ in TITICACA. Also as in formal languages, given an alphabet V, we will denote the 
set of all possible words in V hy V* . 

Consider a finite alphabet V with d letters, and suppose that we want to keep track of all factors 
of length < i? -|- 1, including individual letters. For every word of length < i? -|- 1, introduce a 
variable x[w\. All the x[w] commute with each other. 

Define a weight on words w = wi, . . . ,Wn, by: 

R+l n-r+1 

Weight{w) = XI ' ' ' ' • 

For example, ifi? = 2, thenW eight{SEXY) = x[S]x[E]x[X]x[Y\x[SE]x[EX]x[XY]x[SEX\x[EXY\ 
The weight of a set of words ('language') £, Weight{C), is defined as the sum of the weights of 
all the words belonging to that language. Also, given a language C and a letter v, we will de- 
note by Cv the set of words obtained from C by appending v at the end of each of the words 
of C. Thus if £ = {SEX, LOON}, and R = 1, then Weight{C) = x[S]x[E]x[X]x[SE]x[EX] + 
x[L]x[0]'^x[N]x[LO]x[00]x[ON], and CY = {SEXY, LOONY}. 

The generating function 

:= Yl Wf^ightiw) , 
wev* 

stores all the information about the number of words with a prescribed number of factors of length 
< R+1. So, the number of words in V* that have exactly n„ factors that are u for each u eV* of 
length{u) < i? -|- 1, is the coefficient in of the monomial ^^['u]"'', where the product extends 
over the set {u e V* ,length{u) < i? -|- 1}. 

If we want the generating function for the number of words with a prescribed number of bad words 
and a prescribed number of letters, we may first compute ^r, (where 1 is the maximum length 
of a bad word), and then set x[v] = s for each letter v E V, and x[w] = t, ii w is a bad word, 
and x[w] = 1 otherwise. The coefficient of s^i™ in the resulting generating function would be the 
number of n-letter words with exactly m instances of bad words occurring as factors. If we want 
the generating function for words with no occurrences of dirty words as factors, we set t = 0. 

How to compute For each word v eV*, of length R, let Sof[v] be the subset of V* of words 
that ends with v. Write v = vi, . . . ,VFt.. Every word in Sof[v] is either v itself or of length > R, in 
which case chopping the last letter results in an element of Sof[u], for one of the d tx's of the form 
i,vi, . . . , vr-i. In symbols 

Sof[v] = {v}[J Sof[i,vi,...,VR-i]vR . (SetEq) 
iev 
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Since, for any word w = wi, . . . ,Wn & V* , of length > R, 

R+l 

Weight{wi, . . .,Wn) = Weight{wi,. . .,Wn-i) ■ JJ x[wn-r+i, ■■■,Wn] , 

r=l 

the system of set equations {SetEq) translates to the linear system of (algebraic) equations 

Weight{Sof[v\) = Weigkt{v)+\ JJ a;[i;R_r+i, . . . 1 ^x[i,vi, . . . ,VR]Weigkt{Sof[i,vi, . . . ,vr-i]) 

\r=l / iev 

{Linear -Algehra-Eq) 

We have a system of linear equations for unknowns Weight{Sof[w\). w G V* ,length{w) = 
R, that obviously has a unique solution (on combinatorial grounds!). Since the coefficients are 
polynomials (in fact monomials) in the variables xlw], w G V* ,length{w) < R + l, the solutions 
W eight{Sof[v\) must be rational functions in these variables. 

After solving the system, we get from 

^R= ^ Weight{w)+ ^ Weight{Sof[w\) . 

w&V* ,length{w)<R wGV* , length{w)=R 

Since the first sum is a polynomial and the second sum is a finite sum of rational functions, it 
follows that is a rational function. Hence every specialization, as described above, is also a 
rational function of its variables. 

The Maple Implementation of the Naive Approach is contained in the package NAIVE. After down- 
loading it from this paper's webpage to your working directory, go into Maple by typing maple, 
followed by [Enter]. Once in Maple, load the package by typing read NAIVE;. To get on-line 
help, type ezraO ; , for a list of the procedures, and ezra (procedure jiame) ; , for instructions how 
to use a specific function. The most important function is PhiR that computes ^r. The function 
call is PhiR(Alphabet , R,x) , where Alphabet is the set of letters, R is the non-negative integer R, 
and X is the variable-name for the indexed variables x[w]. For example, PhiR({l},0,z) ; should 
give l/(l-z[l]) . 

The other procedures are Naivegf , Naivest, and Naives, that compute, the long way, what the 
procedures GJgf , GJst and GJs of the package DAVID_IAN, to be described shortly, compute fast. 
Their main purpose is to check the validity of DAVID_IAN and the other packages described later in 
this paper. The readers are warned only to use them for curiosity. 

The Drawback of the Naive Approach 

In order to get the generating function X^^g <^(^)'S", where a(n) := number of words in {A, . . . , Z}* 
of length n with no SEX in it (as a factor), we need to solve a system of 26^ equations and 26^ 
unknowns, then plug in x[A] = ... = x[Z] = s, x[AA] = ... = x[ZZ] = 1, x[AAA] = ... = 
x[ZZZ] = 1, except for x[SEX] = 0. For some economy, we could have made the substitution at 
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the equations themselves, before solving them, but we would still have to solve a system of that 
size. 

If we wanted to find the generating function for SEX-less words with an arbitrary size alphabet, then 
the above method is not even valid in principle. Luckily, we have the powerful Goulden- Jackson 
Cluster method, that can handle such problems very efficiently. 

The Most Basic Version of the Goulden-Jackson Cluster Method 

Consider a finite alphabet V , and a finite set of had words, B. It is required to find a{n) := the 
number of words of length n that do not contain, as factors, any of the members of the set of bad 
words B. For example liV = {E,S,X}, and B = {SEX,XE}, then a(0) = l,a(l) = 3,a(2) = 
8,a(3) = 20,.... 

Of course we may assume that any factor of a bad word of B is not in B, since then the longer 
word would be superfluous, and can be deleted from the set of banned words. For example it is 
not necessary to ban both SEX and SEXY, since any word that contains SEXY in it would also 
contain SEX, and hence the set of words avoiding SEX and SEXY is identical to the set of words 
avoiding SEX. 

As is often the case in combinatorics, we compute the generating function f{s) = ^(^)"*" 
rather than a{n) directly. We know from the Naive section that this is a rational function of s, but 
this fact will emerge again from the Cluster algorithm, and this time the algorithm is efficient. 

The methodology is the venerable Inclusion-Exclusion paradigm that, depending on one's specialty, 
is sometimes known as Mohius Inversion and Sieve methods. The essence of the method is to replace 
the straight counting of a hard-to-count set of 'good guys' by the weighted count of the much larger 
set of pairs 

[arbitrary guy, arbitrary subset of his sins], 

where the weight is (—1) to the power of the cardinality of the subset of his sins. 

Introducing the weight on words weight{w) := s'engt?i(M;)^ f{s), is the weight enumerator of the set 
of words, C{B) that avoid the members of B as factors, i.e. 




weC{B) 



The trick is to add to both sides and rewrite this as 




[ number of factors of w that belong to B ] 



and then use the following deep facts: 



= l + (-l) 



(0 



4 



ri, ifr = 0; , . 

^ \0, ifr>0. ^''^ 

and for any finite set A, 

n 0=11(1 +(-!)) =E(-i)'"' 

aeA aeA ScA 

where as usual, \S\ denotes the cardinality of S. 
We now have, 

J^g-j _ weight{w)()^^^"^^'^^ factors of w that belong to B] _ 

wev* 

Weight{w)(l + (—1)) I""™''^'' °f factors of w that belong to B] _ 



^lengthiw) 

weV* S(ZBad{w) 

where Bad{w) is the set of factors of w that belong to B. For example \i B = {SEX, EXE, XES} 
and w = SEXES, then Bad{w) consists of the factors SEX (occupying the first three letters), 
EXE, (occupying letters 2,3,4), and XES (occupying the last three letters). 

So the desired generating function is also the weight-enumerator of the much larger set consisting of 
pairs {w, S), where S C Bad{w), and now the weight is defined by weight{w, S) = ( — l)l'5'ls''=".9*'^("'). 
Surprisingly, it is much easier to (weight-) count. Wc may think of them as 'marked words', where 
S denotes the subset consisting of those words that the censor, or teacher, was able to detect. 

First, we need a convenient data-structure for these weird objects. Any word w, of length n, 
w = wi . . . Wn, has {""2^) factors Wi, . . . ,Wj , which we will denote by 1 ^ i < j < n. Hence any 
marked word may be represented by {w; [ii, ji], [12, j2], ■ ■ ■ , [ihji]), where Wi^Wi^+i . . . Wj^-iWj^ G 
B, for r = and we make it canonical by ordering the jV; i-e. we arrange the marked factors 

such that ji < j2 < ■ ■ ■ < ji ■ Since no bad word is a proper factor of another bad word, we can 
assume that all the v's are distinct, and that there is no nesting. 

For example if B = {SEX, EXE, XES}, and w = SEXES, then w gives rise to the following 2^ 
marked words: (SEXES;), iSEXES;[l,3]), {SEXES;[2,A]), {SEXES;[3,5]), {SEXES; [I, 3], [2, 4]), 
{SEXES; [1,3], [3,5]), {SEXES; [2,4], [3,5]), {SEXES; [1,3], [2,4], [3,5]). 

For human consumption, it is easier to portray a marked word by a 2-dimensional structure. The 
top line is the word itself, and then, we list each of the factors that are marked on a separate line, 
from right to left. For example the marked word {SEXES; ), is simply 

SEXES , 

the marked word {SEXES; [2,4]), is portrayed as 

SEXES 
EXE 
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while the marked word {SEXES; [1,3], [2, 4], [3, 5]) is written as: 

SEXES 
X E S 
EXE 
SEX 

Given a word w = wi . . .Wn, we will say that two factors and with j < j', overlap if 

they have at least one common letter, i.e. if i < i' < j. 

Let M be the set of these marked words. How to (weight-)count them? Given a non-empty marked 
word {wi . . . Wn', [^2,^2], • • • , [^/jj/])) there are two possibilities regarding the last letter. 

Either ji < n , in which case Wn is not part of any detected bad factor, and deleting it results in 
another, shorter, marked word {wi . . .Wn-i; [^2,^2], • • • , [^/j jj])- We can always restore this 

last letter, and there is an obvious bijection between marked words of length n, in which ji < n 
and pairs (marked words of length n — 1, letter of V). 

The other possibility is that ji = n, then we can't simply delete the last letter Wn- Let k be the 
smallest integer such that [ik,jk] overlaps with [ik+i,jk+i], [ik+i,jk+i] overlaps with [ik+2,jk+2], 
. . ., [ii-i,ji-i] overlaps with j;], then removing the last n — ik + 1 letters from w and the last 

I — k + 1 marked factors, results in a pair of marked words {wi . . . Wi^-i; ■ ■ ■ , [ik-i, jk-i]) 

and (wi^. . . . Wn', [1, Jfc — ik + ■ ■ ■ — ik + — ifc + !])• The first of these two marked words 
could be arbitrary, but the second one has the special property that each of its letters belongs to 
at least one marked factor, and that neighboring marked factors overlap. Let's call such marked 
words clusters and denote the set of clusters by C. 

For example if V = {E, S, X} and B = {SEX, ESE, XES], the marked word 

S E X E S E X 
SEX 
ESE 

SEX 

which in one-dimensional notation is {SEXESEX;[1,'S\,[A,Q],\^^7]), is not a cluster (since [1,3] 
and [4,6] don't overlap), while 

S E X E S E X 
SEX 
ESE 
XES 
SEX 

which in one-dimensional notation is written {SEXESEX; [1, 3], [3, 5], [4, 6], [5, 7]), is a cluster. 

Hence any member of M (i.e. marked word) is either empty (weight 1), or ends with a letter that 
is not part of a cluster, or ends with a cluster. Peeling off the maximal cluster, results in a smaller 
marked word (by definition of maximality). Hence we have the decomposition: 

M = {empty. word} U MV U MC . 
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Taking weights we have, 

weight{M.) = 1 + weight{M.)ds + weight{M)weight(C) 
Since weight{M) = f{s), solving for /(s) yields 

f(s) = - 

^ ^ 1-ds- weight{C) 

It remains to find the weight-enumerator of C, weight(C). 

Let's examine how two bad words u and v can be the last two members of a cluster. This happens 
when a proper suffix (tail) of u coincides with a proper prefix(head) of v. 

For any word w = wi . . . Wn, let HEAD{w) be the set of all proper prefixes: 

HEAD{wi . . . Wn) ■■= {Wi , W1W2 , W1W2W3 , ... , W1W2 . . . Wn-l } , 

and let TAIL{w) be the set of all proper suffixes 

TAIL{wi . . . Wn) ■= {Wn , Wn-lWn , Wn-2Wn-lWn , . . . , W2 . ■ ■ Wn} ■ 

Given two words u and v, define the set OVERLAP{u, v) := TAIL{u) n HEAD{v). 
For example OVERLAP{PICACA,CACACA) = {CA,CACA}. 

If X G HEAD{v), then wc can write v = xx' , where x' is the word obtained from v be chopping off 
its head x. Let's denote x' by v/x. For example SEXYSEX/SEX = YSEX. 

Adopting the notation of [GrKP], section 8.4, let's define 

u : V := weight{v / x) , 

xeOVERLAP{u,v) 

which is a certain polynomial in s. For example 

SEX SEX : EX SEX S = s + , 

corresponding to the following two ways in which SEXSEX can be followed by EXSEXS at the 
end of a cluster: 

EXSEXS 
SEXSEX 

giving rise to weight s, since the leftover is the one-letter S, and 

EXSEXS 
SEXSEX 
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giving rise to the term s^, since the leftover is the four-letter string SEXS. 



Now the set of clusters C, can be partitioned into 

C = U C[v] , 
veB 

where C[v] {v e B), is the set of clusters whose last (top) entry is v. 

Given a cluster in C[v], it either consists of just v, or else, chopping v results in a smaller cluster that 
may end with any bad word u for which OVERLAP{u, v) is non-empty. This means that there is 
a word x G OVERLAP{u, v) for which u = x" x and v = xx', for some non-empty words x" and x'. 
By removing v from the cluster, we lose its tail, x' = v/x, from the underlying word. Conversely, 
given a cluster in C[u] and one of the elements x, of OVERLAP{u,v), we can reconstitute the 
bigger cluster in C[v] by adding v to the end of the cluster, and appending the word v/x into the 
underlying word of the cluster. 

For example, if once again, V = {E, S, X}, and B = {SEX, ESE, XES}, then the cluster 

S E X E S E X 
SEX 
ESE 
XES 
SEX 

belongs to C[SEX]. Chopping the top SEX, results in the smaller cluster 

S E X E S E 
ESE 
XES 
SEX 

that belongs to C[ESE], and so in this example x = SE, and x' = SEX/SE = X. 

We have just established a bijection 

C[v] ^ {{v,[l,length{v)])} (J C[u] x OVERLAP{u,v) , {Set. Equations) 

ueB 

where if C G C[v] has more than one bad word, and is mapped by the above bijection to {C',x), 
then weight{C) = {—l)weight{C')weight{v/x). 

Taking weights, we have 

weight{C[v]) = {—l)weight{v) — ^^(ti : v) ■ weight{C[u]) . {Linear -Equations) 

ueB 

This is a system of |B| linear equations in the \B\ unknowns weight{C[v]) . Furthermore, it is usually 
rather sparse, since for most pair of bad words u and v, OVERLAP{u, v) is empty. In fact, let's 
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denote by Comp{v) the set of bad words n G B for which OVERLAP{u, v) is non-empty, then the 
above system can be rewritten: 

weight{C[v]) = —weight{v) — {u : v) ■ weight{C[u]) . [Linear -Equations') 

u^Comp{v) 

Note that in general is much smaller than (where d is the number of letters in your alphabet 
V, and i?+ 1 is the maximal length of a bad word in B), the number of equations in the system of 
linear equations required by the naive approach described at the beginning. So the Goulden- Jackson 
method is much more efficient, in general. 

After solving (Linear-Equations') , we get weight (C), by using 

weight{C) = weight{C[v]) , 
veB 

which we plug into 



1 — ds — weight{C) 

Example: Find the generating function of all words in {A, B,C,. . . ,X,Y, Z} that avoid the dirty 
words PIPI and CACA. 

Answer: d = 26, and the system is 

(i) weight{C[PIPI]) = -s^ - s'^weight{C[PIPI]) 
{ii) weight{C[CACA]) = -s^ - s'^weight{C[CACA]) 

from which 

weight{C[PIPI]) = weight{C[C AC A]) = -s^/{l + s^) , 
and hence weight(C) = —2s^/{l + s^), and hence 

1 l + 



1 - 26s + 2s4/(l + s2) 1 _ 26s + s^ - 26s3 + 25^ ' 



Another Example: Find the generating function of all words in {A, B,C, . . . ,X,Y, Z} that avoid 
the dirty words PIPI, CACA, PICA and CAPI. 

Answer: d = 26, and the system is 

(i) weight{C[PIPI]) = -s^ - s'^weight{C[PIPI]) - s^weight{C[CAPI]) 
{ii) weight{C[CACA]) = -s^ - s^weight{C[CACA]) - s^weight{C[PICA]) 
(Hi) weight{C[PICA]) = -s^ - s'^weight{C[PIPI]) - s^weight{C[CAPI]) 
(iv) weight{C[CAPI]) = -s^ - s^weight{C[CACA]) - s'^weight{C[PICA]) 

from which 

. 1 + 2s2 

f{s) 



1 - 26s + 2s2 - 52s3 + 4s4 
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Maple Implementation 



A Maple implementation of this is contained in the package DAVID_IAN, downloadable from this 
paper's website fittp : / / www . math . temple . edu/~ zeilberg/gj .html. 



The function call is GJs (Alphabet, Set_of_bad_words,s). For example, to get the generating 
function /(x) = YlnLo^i''^)^"^' where a(n) is the number of ways of spinning a dreidel n times, 
without having a run of length 4 of any of Gimel, Heh, Nun, or Shin, do 
GJs({G,H,N,S},{[G,G,G,G] , [H,H,H,H] , [N,N,N,N] , [S,S,S,S] },x) ; . 

Penney- Ante 

The system of equations (Linear ^Equations') is identical to the one occurring in so-called Penney- 
Ante games, in which each player picks a word, and a coin (or die), with as many faces as letters, 
is tossed (or rolled) until a string matching that of one of the players is encountered, in which case, 
she won. Since the special case of two players and two letters is so beautifully described in [GrKP], 
section 8.4, and the general case is just as beautifully described in Guibas and Odlyzko's paper 
[GuiO], we will only mention here how to use our Maple implementation. The function call, in the 
package DAVID_IAN, is Penney (List_of_letters,List_of_words,Probs). The output is the list of 
probabilities of winning corresponding to the list of words List_of _words. Prob is the way the die 
is loaded, i.e. the probabilities of the respective letters in the list List_of _letters. 

For example, to treat the original example in Walter Penney's paper [P] (see also [GrKP], p. 394), 
in which Alice and Bob flip a coin until either HHT or HTT occurs, and Alice wins in the former case 
while Bob wins in the later case, do (in DAVID_IAN), Penney ( [H,T] , [[H,H,T] , [H,T,T]] , [1/2,1/2]) ;, 
getting the output: [2/3, 1/3] . If the probability of a Head is p, then do 

Penney([H,T] , [[H,H,T] , [H,T,T]] , [p, 1-p] );, getting \p/{p^ - p + 1), {I - p)^ /{p^ - p + I)]. 

In order to check the validity of Penney, we have also written a procedure PenneyGames that 
simulates many Penney-Ante games, and gives the scores of each player. The function call is 
PenneyGames (List_of_letters,List_of_words,Probs,K), where K is the number of individual 
games. Thus typing PenneyGame s ( [H,T] , [ [H,H,T] , [H,T,T] ] , [1/2,1/2] ,300); should give some- 
thing close to [200,100], but the exact outcome changes, of course, for each new batch of 300 
games, according to the whims of Lady Luck. 

Be sure to try also BestLastPlay, which will tell you the best counter-move. 
Keeping Track of the Number of Bad Words 

Almost nobody is perfect. It is extremely unlikely that a long word would contain no bad factors. 
A more general question is to find the number of words amin) in the alphabet V with exactly m 
occurrences of factors that belong to B. Let's define the generating function 



Fis,t) ^a^(n)s"r 



n=0 m=0 
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F{s,t) generalizes f{s) since, obviously, f{s) = F{s,0). 



The above analysis goes almost verbatim. Now we have: 

F(^S,t) = Weight{w)t^^^^'"^^ °^ factors of w that belong to B\ ^ 

w€V* 

and then use the following deep facts: 

t = 1 + (i - 1) , 

and for any finite set A, 

n*=n(i+(*-i))=E(^-i)'"' 

aeA aeA ScA 

where as usual, l^l denotes the cardinality of S. 
We now have, 

J^^g-j _ weight{w)t^^'^"^^^'^ factors of w that belong to B] 

wev* 

— weight(w){l + {t — 1)) °f ff^ctors of w that belong to B] 

wev* 

= (^t_l-^\S\glength{w) 

weV' ScBad{w) 

where Bad{w) is the set of factors of w that belong to B. 
The set of linear equations {Linear -Equations') now becomes: 

weight{C[v\) = {t — l)weight{v) + (t — 1) {u : v) ■ weight{C[u]) , [Linear -Equations") 

ueComp{v) 

and the rest stays the same. 

Maple Implementation: In the package DAVID_IAN, the function that finds F{s,t) is GJst. For 
example let a{n, m) be the number of ways of arranging n children in a line in such a way that 
exactly m boys arc isolated (surrounded by girls on both sides, sec [CoGuy], p. 205). To find the 
generating function F(s, = Er=o Em=o ^)*"*"' do GJst ({B,G} , { [G.B.G] } , s,t) ; . 

Keeping Track of the Individual Counts of Each Obscenity 

Suppose we want to know how many words of length n has mi occurrences of 6i, m2 occurrences 
of 62, ... , ruf occurrences of 6/, where the set of bad words is B = {61, 62, . . . 6/}, we need to keep 
track of the individuality of each bad word. Introducing the variable t[b] for each bad word b e B, 
we now require 

F{s ; t[bi] t[bf]) = Y weight{w) [] t[b] , 

weV* b is a bad factor of w 
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and then use the following: 



t[b] = i+m-i) 



and for any finite set A, 

n t[a] = n (1 + w - 1)) = E n w - 1) • 

aeA aeA ScAaeS 

We now have, 

F{s ■ t[l] t[f] ) = weightiw) J] *[^] 

wG^* b is a bad factor 

length{w) 



E E (n(*w-i))» 

weV* SCBad(w) \bes J 



id(w) 

where Bad{w) is the set of factors of w that belong to B. 
The set of linear equations {Linear -Equations") now becomes: 

weight{C[v]) = {t[v] — 1) • weight{v) + {t[v] — 1) • ^ {u : v) ■ weight{C[u\) , 

ueComp{v) 

{Linear -Equations'" ) 

and the rest stays the same. 

Maple Implementation: In the package DAVID_IAN, the function that finds F{s;t[bi], . . . ,t[bf]) 
is GJstDetail. For example, to number of ways of arranging n kids in line such that there are 
a isolated boys and b isolated girls is the coefficient of s"^t[G, B,G]'^t[B,G, B]^ in the Maclaurin 
expansion of the rational function GJstDetail ({B,G} ,{ [G,B,G] , [B,G,B] },s,t); . 

Keeping Track of the Letters as well 

If you want to know the above information, but also wish to know the individual count of the 
letters, do exactly as above, with the only difference that weight{w) is no longer simply s'ensrt/i(«)) ^ 
but rather {if w = wi . . . Wn)' 

n 

weight{w) ■.= '^^x\}jJi\ 

i=l 



(For example weight{ESSEX) = x[E]'^ x[S]'^ x[X]) . The function calls are GJgf and GJgfDetail. 
We refer the reader to the on-line documentation in the package DAV1D_1AN for instructions. 

Generalizing to the Case of an Arbitrary Set of Bad Words 

What happens if we remove the condition, on the set of bad words B, that no bad word can be a 
proper factor of another bad word? As we saw above, if all we want is the generating function for 
the number of n-letter words that avoid (as factors) the members of B, then we can easily remove 
all members of B that have another member of 5 as a factor, until we get a set of banned words 
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5', that meets the above condition, and that gives the same enumeration. So, as far as applying 
GJs in DAVID_IAN, i.e. finding the generating function f{s), we don't need to generahze. 

But if we are interested in the more general F{s,t), i.e. in GJst, then the original Cluster method 
fails. We will now describe how to modify it. 

Everything goes as before, but now the clusters look different. Given a marked word 
{wi... Wn] [kji], [i2,j2], [iiji]), we may no longer assume that Ji < j2 < • • • < ji, only that 
Ji < J2 < • • ■ < ji, and now we may have nesting: i.e.: it is possible to have: ir < is < js < jr, for 
some s < r. Since the second component of a marked word {wi . . . Wn', [^2,^2], • • • , is 

a set, we may arrange the [Viir] in such a way that jV < jr+i for r = 1, . . . , Z — 1, and if = jV+ij 
then ir < ir+i- For example if B = {AC,CA,CACA,ICAC,TICA,TIT,TI] then the following 
marked word is a cluster: 

TITICACA 

C A 
C A C A 
A C 
I C A C 
T I C A 
TIT 
T I 

In one-dimensional notation it is written: {TITICACA; [1, 2], [1, 3], [3, 6], [4, 7], [6, 7], [5, 8], [7, 8]). 

In the original case, it was easy to enumerate clusters, since removing the rightmost (i.e. top) bad 
word resulted in a smaller cluster. This is no longer true. We are hence forced to introduce the 
larger set of committed clusters. 

The above marked word is a member of C[Cvl]. Chopping the rightmost bad factor, CA, is still a 
cluster: 

TITICACA 
C A C A 
A C 
I C A C 
T I C A 
TIT 
T I 

which belongs to C[CACA], but note that the underlying word has not changed, so the weight 
stays the same, except for a factor of (t — 1). If we chop the rightmost factor again, which is now 
CACA, we get the following cluster 

T I T I C A C 
A C 
I C A C 
T I C A 
TIT 
T I 
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which belongs to C[^C], BUT, unhke the previous scenario, in which ANY cluster in C[AC] could 
have been gotten, now we MUST have the 3'"'^ letter from the end be a C. 

Such a situation occurs whenever we have u,v E B such that v = xuy, where both x and y are 
non-empty words in the alphabet V. For each such pair, we introduce the set C'[x,u], which is the 
set of clusters whose rightmost bad word is u, and the underlying word ends with xu. Now we 
have many more unknowns and many more equations, we set them up in an analogous way. But 
at the end, after solving the system, when we compute weight{C), wc only sum weight{C[v]) , and 
ignore all the weightlC'lujx]). Note that weight{C'[u,x]) play the roles of catalysts, that enable 
the chemical reaction, but at the end are discarded. 

We leave it to the readers to fill in the details. The readers may get a clue from examining the 
Maple implementation JODO, that does the job, and which we will now describe. 

JODO: The Maple implementation of the Generalized Cluster Method 

The main routine is GJNZst that computes the generating function F{s, t). For example to find the 
number of 10-letter words in the alphabet, {P, 1} containing exactly 13 factors that are either PI, or 
PIPI, take the coefficient of s^H^^ in the Taylor expansion of GJNZst ({I , P} , { [P , I] , [P , I , P , I] } , s , t) ; . 

An Interesting Application of JODO to Counting Runs 

A run in a word, is a string of a repeated letter. Given a set of bad words S, it is of interest to 
know how many words are there avoiding B as factors and having a specified number of maximal 
runs. It is also of interest to know the average number of maximal runs. It can be shown that for 
any finite set of bad words B, the average number of runs in an n- letter word avoiding the words 
of B as factors is asymptotically C{B)n, where C{B) is a certain algebraic number that depends, 
of course, on B. 

Note that a new maximal run starts whenever we have an occurrence of any two-letter word ab, 
with a ^ b. So all we have to do is append to B these words, giving them the variable t, and then 
use a variant of GJNZ to find the generating function. The relevant functions are Runs and AvRuns. 
The implementation details may be found in the package. 

Generalizing to Non-Consecutive Bad Words 

So far, we wanted to avoid factors, i.e. the occurrence of a bad word occurring as consecutive letters. 
Suppose we want to avoid SEX but also the possibility that SEX would appear when the letters 
are separated by one place, i.e., in addition to SEX, we don't want factors of the form S7E7X, 
or SlEX, or SEIX, where a question-mark could stand for any character. Hence SHEXY would 
be censored as would ASELX, but ASHOEOOX would be allowed. In other words, we want to 
include as our set of bad words, words including a blank, where, for example, [T, BL, T], means that 
whenever two T' s are separated by exactly one letter, we count it as a bad word. The analysis goes 
almost verbatim, and the details can be found by examining the source code of the Maple package 
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BLANKS, that is yet another Maple package that accompanies this paper. 



The Maple Package BLANKS 

The principal routines are BLANKSst and BLANKSsO. The function calls are BLANKSst (alphabet , 
BL, MISTAKES) and BLANKSsO (alphabet , BL, MISTAKES), where alphabet is the set of letters, 
BL is the symbol denoting the blank, and MISTAKES is the set of bad words, that are lists in the 
alphabet V U {BL}. 

For example, to find the generating function 

n m 

for a{n,m), the number of 0-1 sequences of length n, w = w-\_, . . . ,Wn, that have exactly m occur- 
rences of either Wi = fWj+i = or Wi = = or Wi = Wi+s = fWj+e, type, in BLANKS, 
BLANKSst({0,l},B,{[0,0,0] , [1,1,1], [O.B.O.B.O] , [1,B,1,B,1], [0,B,B,0,B,B,0] , 
[1,B,B,1,B,B,1] }). 

If you want F{s,0), the generating function for a(n) := the number of 0-1 sequences of length n 
with none of the above (i.e. the number of ways of 2— coloring the integers {1,2,..., n} such that 
you don't have a mono-chromatic arithmetic sequence of length 3 and difference < 3, then type: 
BLANKSsO({l,2},B,{[0,0,0] , [1,1,1], [0,B,0,B,0] , [1,B,1,B,1], [0,B,B,0,B,B,0] , 
[1,B,B,1,B,B,1] }). 

Exploiting Symmetry 

Often the set of bad words is invariant cither under the action of the symmetric group (in case when 
the alphabet is, say, {1,2,..., n}), or under the action of the group of signed permutations, (when 
the alphabet is, {— 1, 1, — 2, 2, . . . , — ra, n}). Then by symmetry, the Cluster generating functions 
weight{C[w]) only depend on the equivalence class of w, and there are many fewer equations, 
and many fewer unknowns. The two Maple packages SYMGJ and SPGJ implement these two cases 
respectively. We refer the readers to the on-line documentation for details. 

Series Expansions 

Many times the set of equations is too big for Maple to solve exactly. Nevertheless, using the set 
of equations {Linear .Equations') or its analogs, we can itcrativcly get series expansions for the 
Cluster generating function, and hence for the generating function itself, to any desired number 
of terms. The procedure GJseries in DAVID_IAN handles this. The package GJseries is a more 
efficient implementation of these ideas. 

Applications 

The applications to Self- Avoiding Walks (see [MS] for a very readable introduction to this subject) is 
described in [N]. The package GJSAW, that also comes with this paper, is a targeted implementation. 



15 



Another application is to the computation of the number of ternary square-free words (e.g. [B],[Cu]), 
which are sequences in the alphabet {1, 2, 3} that do not contain a 'square' i.e. a factor of the form 
uu where u is a word of any length. As such, the set of bad words, B, is infinite, and the present 
theory would have to be extended. However, we can find upper bounds and exact series expansions, 
by limiting the length of u. In particular, taking the set of bad words to be uu, where u is of length 
< 23, the first 48 terms of the sequence a(n) := number of n-letter words in the alphabet {1, 2, 3} 
that avoid uu with length{u) < 22 coincides with the first 46 terms of the real thing (i.e. a(0) 
through a(45)), and using GJsqfree (which is a Maple package targeted to deal with square-free 
words), we were able to extend sequence M2550 of [SP], to 46 terms: 

M2250 1, 3, 6, 12, 18, 30, 42, 60, 78, 108, 144, 204, 264, 342, 456, 618, 798, 1044, 1392, 1830, 
2388, 3180, 4146, 5418, 7032, 9198, 11892, 15486, 20220, 26424, 34422, 44862, 58446, 76122, 99276, 
129516, 168546, 219516, 285750, 372204, 484446, 630666, 821154, 1069512, 1392270, 1812876, 
2359710, 3072486. 

It is well known and easy to see (e.g. [MS], p. 9) that the obvious inequality a{n + m) < a{n)a{m) 
implies that ^ := lim„^oo a(n)^/" exists. 

Using Zinn- Justin's method, described in [Gut], we were able to estimate that // « 1.302, and that 
if, as is reasonable to conjecture, a{n) ~ /i^ra^, then 0^0. 

Hence we have ample evidence to the following: 

Conjecture: The number of n-letter square-free ternary words is given, asymptotically by a(n) ~ 
C/i", where n := lim„^oo a(n)^/". 

In [B] it is shown that 2^/24 _ ^ ^^le upper bound < 1.316 is stated. Using the 

series expansion for 'finite-memory' (memory 23) square-free words, as above, we found the sharper 
upper bound // < 1.30201064. 

The Maple package GJsqfree 

The Maple package GJsqfree, that is also available from this paper's website, is a targeted imple- 
mentation to the case of counting square-free words. The main procedure is Series, that gives the 
first NUTERMS + 1 terms of the sequence enumerating the number of words in an alphabet of 
DIM letters that avoid factors of the form uu, where the length of n is < MEMO. In particular, the 
first 2{MEMO+ 1) terms of this sequence coincide with those of the sequence of square-free words. 
The function call is: Series (MEMO, DIM, NUTERMS) ; . For example to get the sequence above, we 
entered Series (23, 3, 47) ; 
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